Metadata Log Entries metadata table#667
Conversation
pyiceberg/table/snapshots.py
Outdated
| parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None) | ||
| sequence_number: Optional[int] = Field(alias="sequence-number", default=None) | ||
| # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import | ||
| sequence_number: Optional[int] = Field(alias="sequence-number", default=0) |
There was a problem hiding this comment.
Is there a reason the default value for the sequence number has to be changed to 0 as opposed to None?
There was a problem hiding this comment.
According to the spec, https://iceberg.apache.org/spec/#version-2
Snapshot JSON:
sequence-number was added and is required; default to 0 when reading v1 metadata
Also added this in the PR description
There was a problem hiding this comment.
@kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making sequence_number default to 0 here leads to sequence_number=0 be written to version 1 table metada's snapshots, which is not allowed by spec:
Writing v1 metadata:
Snapshot field sequence-number should not be written
I think we may need a new field_serializer in TableMetadataCommonFields class or some other ways to correct the behavior on write. WDYT?
There was a problem hiding this comment.
Thank you! I missed the part about the V1 spec. Following your suggestion, I added a field_serializer for TableMetadataCommonFields. This will ensure that the Snapshot pydantic object will not have the sequence-number field for V1 format
HonahX
left a comment
There was a problem hiding this comment.
@kevinjqliu Thanks for working on this! It looks geat.
pyiceberg/table/__init__.py
Outdated
| # imitates `addPreviousFile` from Java | ||
| # https://github.com/apache/iceberg/blob/8248663a2a1ffddd2664ea37b45882455466f71c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1450-L1451 | ||
| metadata_log_entries = self.tbl.metadata.metadata_log + [ | ||
| MetadataLogEntry(metadata_file=self.tbl.metadata_location, timestamp_ms=self.tbl.metadata.last_updated_ms) |
There was a problem hiding this comment.
It seems this line acts more like https://github.com/apache/iceberg/blob/8a70fe0ff5f241aec8856f8091c77fdce35ad256/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L62-L66.
Just curious the reason that you mention addPreviousFile here, which seem to be more relevant when we update the metadata_log during table commit.
BTW, this reminds me that currently non-rest catalog does not update the metadata_log field during commit.
There was a problem hiding this comment.
good catch! I link the wrong code. I was trying to find all the places where the metadata log was modified
pyiceberg/table/snapshots.py
Outdated
| parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None) | ||
| sequence_number: Optional[int] = Field(alias="sequence-number", default=None) | ||
| # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import | ||
| sequence_number: Optional[int] = Field(alias="sequence-number", default=0) |
There was a problem hiding this comment.
@kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making sequence_number default to 0 here leads to sequence_number=0 be written to version 1 table metada's snapshots, which is not allowed by spec:
Writing v1 metadata:
Snapshot field sequence-number should not be written
I think we may need a new field_serializer in TableMetadataCommonFields class or some other ways to correct the behavior on write. WDYT?
HonahX
left a comment
There was a problem hiding this comment.
@kevinjqliu Thanks for the quick update! Just have one minor comment for the test. Overall it looks great!
|
Merged! @kevinjqliu Thanks for another great metadata table work! Thanks @corleyma @syun64 @Fokko for reviewing! |
Resolves #594 (and part of #511)
This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (
metadata_log_entries).To query the metadata table, use
References
MetadataLogEntriesTable.javaMetadata log, append the latest metadata
The metadata log entries append the latest
TableMetadatain its operation (code).This leads to a surprising behavior where the last row of the metadata entries table is based on when the query ran.
For example,
Snapshot
sequence-numberdefault valueThere's an issue with reading V1 spec where the
sequence-numberisNoneinstead of0. According to the Iceberg spec, when reading v1 metadata for v2, theSnapshotfieldsequence-numbermust default to 0 (source).Similarly when writing V1 spec from V2, the
sequence-numbershould not be written. This is achieved by the newfield_serializerfunction inTableMetadataCommonFields.